Multi-task learning in under-resourced Dravidian languages

نویسندگان

چکیده

Abstract It is challenging to obtain extensive annotated data for under-resourced languages, so we investigate whether it beneficial train models using multi-task learning. Sentiment analysis and offensive language identification share similar discourse properties. The selection of these tasks motivated by the lack large labelled user-generated code-mixed datasets. This paper works with YouTube comments Tamil, Malayalam, Kannada languages. Our framework applicable other sequence classification problems irrespective size Experiments show that our learning model can achieve high results compared single-task while reducing time space constraints required on individual tasks. Analysis fine-tuned indicates preference over single task resulting in a higher weighted F1 score all three We apply two approaches Dravidian Kannada, Tamil. Maximum scores Malayalam were achieved mBERT subjected cross entropy loss an approach hard parameter sharing. Best Tamil was DistilBERT soft sharing as architecture type. For sentiment identification, best performing scored F1-Score (66.8%, 90.5%), (59%, 70%) (62.1%,75.3%) respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Eigentrigraphemes for under-resourced languages

Grapheme-based modeling has an advantage over phone-based modeling in automatic speech recognition for under-resourced languages when a good dictionary is not available. Recently we proposed a new method for parameter estimation of context-dependent hidden Markov model (HMM) called eigentriphone modeling. Eigentriphone modeling outperforms conventional tied-state HMM by eliminating the quantiza...

متن کامل

Joint Bayesian Morphology learning for Dravidian languages

In this paper a methodology for learning the complex agglutinative morphology of some Indian languages using Adaptor Grammars and morphology rules is presented. Adaptor grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinative languages and morphological boundaries are inferred from a plain text corpus. Once morphologica...

متن کامل

Useful technique for low-resourced/under-resourced languages

متن کامل

Acoustic modelling for under-resourced languages

Over the past decades research in the field of automatic speech recognition has lead to systems with a sufficiently high grade of maturity that makes them suitable for use in real-life applications. However, such recognition systems have been developed only for very few languages. Languages addressed are mainly those with a large population, a high economic power, or for which a high political ...

متن کامل

Multi-Task Learning Using Mismatched Transcription for Under-Resourced Speech Recognition

It is challenging to obtain large amounts of native (matched) labels for audio in under-resourced languages. This could be due to a lack of literate speakers of the language or a lack of universally acknowledged orthography. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the language (in place of native speak...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Data, Information and Management

سال: 2022

ISSN: ['2524-6356', '2524-6364']

DOI: https://doi.org/10.1007/s42488-022-00070-w